Evaluating Machine Learning Models for Breast Cancer Classification¶

Breast cancer classification is a critical task in medical data science. The goal is to build a model that can accurately predict whether a tumor is benign or malignant based on a set of features. This project focuses on applying Support Vector Machine (SVM) techniques.

Objective:

Classification Task: Use SVM to predict whether a breast cancer diagnosis is malignant or benign.

Target Variable: Diagnosis classification (malignant or benign).

Evaluation Metrics:

  • Accuracy: Measure the proportion of correctly classified diagnoses.
  • Precision: Evaluate the proportion of true positives among all positive predictions.
  • Recall: Assess the proportion of true positives among all actual positives.
  • F1 Score: Compute the harmonic mean of precision and recall to balance both metrics.
  • ROC-AUC: Evaluate the model's ability to distinguish between malignant and benign cases.

Data Collection and Understanding¶

This data used in this project is available in Breast Cancer Wisconsin (Diagnostic) Data Set

In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('/content/drive/MyDrive/2023/data.csv')
In [3]:
df.shape
Out[3]:
(569, 33)

The dataset used in this analysis contains 569 rows and 33 columns. Each row represents an individual observation, while the columns correspond to various features or attributes of the data. The first five data observations are shown below:

In [4]:
df.head()
Out[4]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN

5 rows × 33 columns

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

Upon examining the data types, it was found that all features are of type float, while the target variable is of type object (indicating categorical data). Since the purpose of this project is classification, the target variable will be crucial for building the model. Additionally, the 'id' feature, which uniquely identifies each observation, will not be necessary for the classification task and can be excluded from further analysis, as well as 'Unnamed' feature.

In [6]:
df = df.drop(['id', 'Unnamed: 32'], axis = 1)
In [7]:
df.isnull().sum()
Out[7]:
0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0

After examining the dataset for missing values, it was confirmed that there are none present in any of the features. This means that the data is complete and can be used for analysis without requiring any handling of missing data.

Exploratory Data Analysis¶

To gain a better understanding of the dataset, summary statistics for the numerical features are generated below. These statistics include measures such as the mean, standard deviation, minimum, and maximum values, as well as the quartiles, providing insight into the distribution and range of the data.

In [8]:
df.describe().T
Out[8]:
count mean std min 25% 50% 75% max
radius_mean 569.0 14.127292 3.524049 6.981000 11.700000 13.370000 15.780000 28.11000
texture_mean 569.0 19.289649 4.301036 9.710000 16.170000 18.840000 21.800000 39.28000
perimeter_mean 569.0 91.969033 24.298981 43.790000 75.170000 86.240000 104.100000 188.50000
area_mean 569.0 654.889104 351.914129 143.500000 420.300000 551.100000 782.700000 2501.00000
smoothness_mean 569.0 0.096360 0.014064 0.052630 0.086370 0.095870 0.105300 0.16340
compactness_mean 569.0 0.104341 0.052813 0.019380 0.064920 0.092630 0.130400 0.34540
concavity_mean 569.0 0.088799 0.079720 0.000000 0.029560 0.061540 0.130700 0.42680
concave points_mean 569.0 0.048919 0.038803 0.000000 0.020310 0.033500 0.074000 0.20120
symmetry_mean 569.0 0.181162 0.027414 0.106000 0.161900 0.179200 0.195700 0.30400
fractal_dimension_mean 569.0 0.062798 0.007060 0.049960 0.057700 0.061540 0.066120 0.09744
radius_se 569.0 0.405172 0.277313 0.111500 0.232400 0.324200 0.478900 2.87300
texture_se 569.0 1.216853 0.551648 0.360200 0.833900 1.108000 1.474000 4.88500
perimeter_se 569.0 2.866059 2.021855 0.757000 1.606000 2.287000 3.357000 21.98000
area_se 569.0 40.337079 45.491006 6.802000 17.850000 24.530000 45.190000 542.20000
smoothness_se 569.0 0.007041 0.003003 0.001713 0.005169 0.006380 0.008146 0.03113
compactness_se 569.0 0.025478 0.017908 0.002252 0.013080 0.020450 0.032450 0.13540
concavity_se 569.0 0.031894 0.030186 0.000000 0.015090 0.025890 0.042050 0.39600
concave points_se 569.0 0.011796 0.006170 0.000000 0.007638 0.010930 0.014710 0.05279
symmetry_se 569.0 0.020542 0.008266 0.007882 0.015160 0.018730 0.023480 0.07895
fractal_dimension_se 569.0 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
radius_worst 569.0 16.269190 4.833242 7.930000 13.010000 14.970000 18.790000 36.04000
texture_worst 569.0 25.677223 6.146258 12.020000 21.080000 25.410000 29.720000 49.54000
perimeter_worst 569.0 107.261213 33.602542 50.410000 84.110000 97.660000 125.400000 251.20000
area_worst 569.0 880.583128 569.356993 185.200000 515.300000 686.500000 1084.000000 4254.00000
smoothness_worst 569.0 0.132369 0.022832 0.071170 0.116600 0.131300 0.146000 0.22260
compactness_worst 569.0 0.254265 0.157336 0.027290 0.147200 0.211900 0.339100 1.05800
concavity_worst 569.0 0.272188 0.208624 0.000000 0.114500 0.226700 0.382900 1.25200
concave points_worst 569.0 0.114606 0.065732 0.000000 0.064930 0.099930 0.161400 0.29100
symmetry_worst 569.0 0.290076 0.061867 0.156500 0.250400 0.282200 0.317900 0.66380
fractal_dimension_worst 569.0 0.083946 0.018061 0.055040 0.071460 0.080040 0.092080 0.20750

A pairplot will be used to explore the relationships between features and their interaction with the target variable. This visualization helps to examine pairwise relationships and distributions of features within the dataset. The focus will be on a subset of features to reveal correlations and patterns with the target variable. By color-coding the points based on the target variable, any distinct patterns or separations between different classes can be assessed. This analysis aims to provide insights into potential correlations and patterns that may influence feature selection and model building processes.

In [9]:
import seaborn as sns
import matplotlib.pyplot as plt

target = df.columns[0]
features_mean = df.columns[1:11]

plt.figure(figsize=(10, 6))
sns.pairplot(df, vars= features_mean, hue= target)
plt.show()
<Figure size 1000x600 with 0 Axes>
In [10]:
features_se = df.columns[11:21]

plt.figure(figsize=(10, 6))
sns.pairplot(df, vars= features_se, hue= target)
plt.show()
<Figure size 1000x600 with 0 Axes>
In [11]:
features_worst = df.columns[21:31]

plt.figure(figsize=(10, 6))
sns.pairplot(df, vars= features_worst, hue= target)
plt.show()
<Figure size 1000x600 with 0 Axes>

The provided correlation matrix offers a preliminary overview of the dataset's structure and potential relationships among its features. Each square represents the correlation between two variables, visualized through scatterplots. Diagonal plots depict feature distributions, indicating potential skewness or outliers. Preliminary observations suggest a complex interplay between variables, with some exhibiting strong linear correlations while others show more dispersed patterns. The color differentiation within scatterplots hints at potential class separation, which could be a valuable indicator for classification tasks.

To understand the distribution of the target variable within the dataset, a pie chart will be created. This visualization will illustrate the proportion of each class within the target variable, providing a clear view of the balance or imbalance between different classes. By examining the pie chart, it will be possible to assess the relative frequency of each diagnosis category, which can inform subsequent analysis and modeling decisions.

In [12]:
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
In [15]:
import matplotlib.pyplot as plt

diagnosis_counts = df['diagnosis'].value_counts()

plt.figure(figsize=(8, 4))
plt.pie(diagnosis_counts, labels=diagnosis_counts.index, autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Distribution of Diagnosis')
plt.show()

The pie chart reveals that class 1 (malignant) constitutes 37.3% of the dataset, while class 0 (benign) makes up the remaining 62.7%. This distribution indicates an imbalance, with a higher proportion of benign cases compared to malignant ones. Such an imbalance may affect the performance of classification models, potentially leading to bias towards the majority class.

Data Preprocessing¶

Undersampling Technique¶

Due to the dataset's imbalance, with 37.3% of instances being malignant and 62.7% benign, an undersampling technique will be used to balance the class distribution. This method reduces samples from the majority class (benign) to mitigate bias and improve the model's ability to learn from both classes equally, resulting in a more balanced and fair classification model.

In [16]:
from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(sampling_strategy = 'auto', random_state = 42)
X, y = undersampler.fit_resample(df.drop('diagnosis', axis = 1), df['diagnosis'])

df = pd.DataFrame(X, columns = df.columns[1:])
df['diagnosis'] = y
In [17]:
plt.figure(figsize=(8, 4))
plt.pie(y.value_counts(), labels=diagnosis_counts.index, autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Distribution of Diagnosis')
plt.show()

Feature Selection¶

Feature selection will be conducted by analyzing the correlation between each feature and the target variable. Features with strong positive or negative correlations are more likely to be informative for the model, while weakly correlated features may be less useful. This approach helps in selecting the most predictive features, improving model performance and reducing dataset dimensionality.

In [18]:
corr_matrix = X.corr()
corr_with_target = X.corrwith(y)

selected_features = corr_with_target[abs(corr_with_target) > 0.5].index
print(selected_features)
Index(['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean',
       'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se',
       'area_se', 'radius_worst', 'perimeter_worst', 'area_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst'],
      dtype='object')

Based on the correlation analysis, 15 out of 30 features were selected for their strong relationship with the target variable. This refined set of features is expected to improve the model’s performance by focusing on the most relevant inputs, reducing dimensionality, and minimizing the risk of overfitting.

Data Standarization¶

In [19]:
X = df[['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean',
       'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se',
       'area_se', 'radius_worst', 'perimeter_worst', 'area_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst']]
In [20]:
y = y.to_numpy()
In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

Model Building and Evaluation¶

In this project, six different classification methods are employed to build and evaluate the predictive model: Logistic Regression, Decision Tree Classifier, Random Forest, Support Vector Machine (SVM), Gaussian Naive Bayes (NB), and K-Nearest Neighbors (KNN). Each method is assessed using K-Fold Cross-Validation to ensure robust performance evaluation and mitigate potential overfitting. The primary metrics used for evaluation are Accuracy and Area Under the Curve (AUC), which provide insights into the model's overall performance and its ability to distinguish between classes. This comprehensive approach allows for a thorough comparison of various classification techniques to determine the most effective model for the given dataset.

In [22]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, accuracy_score
from sklearn.model_selection import KFold
from sklearn.base import clone

def classification_model(model, X, y, k = 5):
  kf = KFold(n_splits = k, shuffle = True, random_state = 42)

  accuracies = []
  aucs = []

  plt.figure(figsize = (8, 4))
  mean_fpr = np.linspace(0, 1, 100)
  tprs = []

  for i, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    clf = clone(model)
    clf.fit(X_train, y_train)

    y_prob = clf.predict_proba(X_test)[:, 1]
    y_pred = clf.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)

    accuracies.append(accuracy)
    aucs.append(roc_auc)

    tprs.append(np.interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    plt.plot(fpr, tpr, lw = 2, alpha = 0.6, label = f'ROC fold{i+1} (AUC - {roc_auc:.2f})')

  mean_tpr = np.mean(tprs, axis = 0)
  mean_tpr[-1] = 1.0
  mean_auc = auc(mean_fpr, mean_tpr)
  plt.plot(mean_fpr, mean_tpr, color = 'blue', label = f'Mean ROC (AUC - {mean_auc:.2f})', lw = 2, alpha = 1)

  plt.plot([0, 1], [0, 1], color = 'gray', lw = 2, linestyle = '--', label = 'Random Classifier')

  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate')
  plt.title('Receiver Operating Characteristic')
  plt.legend(loc = 'lower right')
  plt.show()

  mean_accuracy =np.mean(accuracies)
  print(f'Mean Accuracy: {mean_accuracy:.4f}\n')
  print(f'Mean AUC: {mean_auc:.4f}\n')

  for i in range(k):
    print(f'Fold{i+1}-Accuracy: {accuracies[i]:.4f}, AUC: {aucs[i]:.4f}')

  return accuracies, aucs
In [23]:
from sklearn.linear_model import LogisticRegression

model  = LogisticRegression()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9339

Mean AUC: 0.9846

Fold1-Accuracy: 0.9647, AUC: 0.9983
Fold2-Accuracy: 0.8941, AUC: 0.9770
Fold3-Accuracy: 0.9529, AUC: 0.9877
Fold4-Accuracy: 0.9529, AUC: 0.9961
Fold5-Accuracy: 0.9048, AUC: 0.9852
In [24]:
from sklearn.tree import DecisionTreeClassifier

model  = DecisionTreeClassifier()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9294

Mean AUC: 0.9300

Fold1-Accuracy: 0.9294, AUC: 0.9328
Fold2-Accuracy: 0.8824, AUC: 0.8785
Fold3-Accuracy: 0.9412, AUC: 0.9443
Fold4-Accuracy: 0.9176, AUC: 0.9196
Fold5-Accuracy: 0.9762, AUC: 0.9761
In [25]:
from sklearn.ensemble import RandomForestClassifier

model  = RandomForestClassifier()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9411

Mean AUC: 0.9838

Fold1-Accuracy: 0.9412, AUC: 0.9967
Fold2-Accuracy: 0.8824, AUC: 0.9611
Fold3-Accuracy: 0.9529, AUC: 0.9924
Fold4-Accuracy: 0.9647, AUC: 0.9970
Fold5-Accuracy: 0.9643, AUC: 0.9943
In [26]:
from sklearn.svm import SVC

model  = SVC(probability = True)
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9363

Mean AUC: 0.9837

Fold1-Accuracy: 0.9529, AUC: 0.9972
Fold2-Accuracy: 0.9059, AUC: 0.9714
Fold3-Accuracy: 0.9412, AUC: 0.9922
Fold4-Accuracy: 0.9412, AUC: 0.9900
Fold5-Accuracy: 0.9405, AUC: 0.9898
In [27]:
from sklearn.naive_bayes import GaussianNB

model  = GaussianNB()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9174

Mean AUC: 0.9751

Fold1-Accuracy: 0.9412, AUC: 0.9872
Fold2-Accuracy: 0.9059, AUC: 0.9653
Fold3-Accuracy: 0.9059, AUC: 0.9625
Fold4-Accuracy: 0.9294, AUC: 0.9917
Fold5-Accuracy: 0.9048, AUC: 0.9858
In [28]:
from sklearn.neighbors import KNeighborsClassifier

model  = KNeighborsClassifier()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9316

Mean AUC: 0.9700

Fold1-Accuracy: 0.9647, AUC: 0.9961
Fold2-Accuracy: 0.9176, AUC: 0.9474
Fold3-Accuracy: 0.9412, AUC: 0.9838
Fold4-Accuracy: 0.9176, AUC: 0.9642
Fold5-Accuracy: 0.9167, AUC: 0.9744

Model Evaluation Results¶

The model evaluation results are summarized in the table below, showing the mean accuracy and mean AUC values for each classification method used. This summary provides an overall assessment of the model's performance, highlighting how well each method performs on average across the different folds of cross-validation.

Classification Model Mean Accuracy Mean AUC
Logistic Regression 0.9339 0.9846
Decision Tree 0.9270 0.9269
Random Forest 0.9434 0.9838
Support Vector Machine 0.9363 0.9837
Gaussian Naive-Bayes 0.9174 0.9751
K-Nearest Neighbor 0.9136 0.9700

The results indicate that the Random Forest classifier achieved the highest accuracy, suggesting it is the most effective method for classifying breast cancer cases. This reflects its strong performance in correctly identifying both malignant and benign instances. Conversely, Logistic Regression recorded the highest AUC, highlighting its superior discriminatory power. This means that Logistic Regression is particularly adept at distinguishing between the two classes, providing more precise predictions on the likelihood of a diagnosis.

In [68]:
!jupyter nbconvert --to html /content/Evaluating_Machine_Learning_Models_for_Breast_Cancer_Classification.ipynb
[NbConvertApp] Converting notebook /content/Evaluating_Machine_Learning_Models_for_Breast_Cancer_Classification.ipynb to html
[NbConvertApp] Writing 9350995 bytes to /content/Evaluating_Machine_Learning_Models_for_Breast_Cancer_Classification.html